Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slotted array vectorized #376

Merged
merged 19 commits into from
Jul 26, 2024
Merged

Slotted array vectorized #376

merged 19 commits into from
Jul 26, 2024

Conversation

Scooletz
Copy link
Contributor

@Scooletz Scooletz commented Jul 24, 2024

This PR changes the way SlottedArray uses vectorized search.

Previously, it was using the Span.IndexOf over a set of slots. This resulted in 2x more data being searched through, with potential (but quite unlikely) false positive hits when the Slot.Raw was equal to the hash. Additionally, when hash collision occurred, it requires multiple calls to the Span.IndexOf to move the search across the map.

This PR makes a change by aligning SlottedArray to the biggest vector allowed on the given platform, Vector256 for x64 and Vector128 for ARM. It does it by introducing chunks of size of the vector that describe the hashes and the raw part providing the markers and the pointer in page. For x64, this requires to allocate 32 bytes for hashes + 32 bytes for raw = 64 bytes of chunk. There's no other overhead beside some slots potentially not being used (max 60 bytes). This approach allows to scan hashes in a nicely vectorized way without dealing with false positives and checks whether a given value is at a hash offset or not. Then, if the search is not done, jump over the vector of Slots to the next one.

The advantages are the following:

  1. 2x less comparisons
  2. better hash collision handling (if they occur in the same vector)
  3. no reminder handling as everything is vector aligned
  4. a simple and tight loop

The vector alignment has been tested on x64, but showed no positive impact in benchmarks

TODO List

  • support Vector128 for ARM and others (throws on scalar)
  • fix the hidden bug
  • add benchmarks
    • hash collision
    • odd nibble path (no impact on construction, rather on Prepare/Unprepare)
    • path longer than 4 so that the NibblePath comparison happens (this will compare equals only)
    • restore defragmentation benchmark (not needed)
    • restore iteration benchmark
  • consider materializing the NibblePath in TryFind or before so that the comparison above works as expected (left for future work)
  • amend design doc

Benchmarks

The new benchmarks, based on aligned memory, to make them truly comparable show great results. Especially, if the search is not found in the initial keys but requires more iterations.

TryGet

Before
Method sliceFrom length index odd Mean Error StdDev Code Size
TryGet ? ? 1 False 7.180 ns 0.1448 ns 0.1355 ns 4,068 B
TryGet ? ? 15 False 7.644 ns 0.1158 ns 0.0967 ns 4,033 B
TryGet ? ? 16 False 8.136 ns 0.1283 ns 0.1200 ns 4,033 B
TryGet ? ? 31 False 8.841 ns 0.1750 ns 0.1637 ns 4,098 B
TryGet ? ? 32 False 9.113 ns 0.0999 ns 0.0934 ns 4,079 B
TryGet ? ? 47 False 9.284 ns 0.0931 ns 0.0871 ns 4,057 B
TryGet ? ? 48 False 10.058 ns 0.0464 ns 0.0412 ns 4,100 B
TryGet ? ? 63 False 10.383 ns 0.1024 ns 0.0957 ns 4,035 B
TryGet ? ? 64 False 10.979 ns 0.0954 ns 0.0893 ns 4,036 B
TryGet ? ? 95 False 12.936 ns 0.1134 ns 0.1006 ns 4,083 B
TryGet ? ? 96 False 11.370 ns 0.1728 ns 0.1616 ns 4,096 B
After
Method sliceFrom length index odd Mean Error StdDev Code Size
TryGet ? ? 1 False 7.512 ns 0.1016 ns 0.0951 ns 3,310 B
TryGet ? ? 15 False 7.455 ns 0.0670 ns 0.0627 ns 3,266 B
TryGet ? ? 16 False 7.903 ns 0.0643 ns 0.0601 ns 3,253 B
TryGet ? ? 31 False 7.816 ns 0.0808 ns 0.0756 ns 3,253 B
TryGet ? ? 32 False 8.174 ns 0.0827 ns 0.0774 ns 3,240 B
TryGet ? ? 47 False 8.147 ns 0.0778 ns 0.0650 ns 3,220 B
TryGet ? ? 48 False 8.464 ns 0.1305 ns 0.1157 ns 3,229 B
TryGet ? ? 63 False 8.487 ns 0.1186 ns 0.1110 ns 3,186 B
TryGet ? ? 64 False 8.922 ns 0.1604 ns 0.1501 ns 3,217 B
TryGet ? ? 95 False 9.197 ns 0.1391 ns 0.1233 ns 3,171 B
TryGet ? ? 96 False 10.232 ns 0.1698 ns 0.1588 ns 3,175 B

TryGet_With_Hash_Collisions

Before
Method index Mean Error StdDev Code Size
TryGet_With_Hash_Collisions 1 23.79 ns 0.168 ns 0.157 ns 4,458 B
TryGet_With_Hash_Collisions 2 14.18 ns 0.172 ns 0.161 ns 4,338 B
TryGet_With_Hash_Collisions 3 23.95 ns 0.096 ns 0.090 ns 4,497 B
TryGet_With_Hash_Collisions 4 14.22 ns 0.120 ns 0.112 ns 4,345 B
TryGet_With_Hash_Collisions 30 15.72 ns 0.237 ns 0.222 ns 4,316 B
TryGet_With_Hash_Collisions 31 23.88 ns 0.193 ns 0.180 ns 4,496 B
After
Method index Mean Error StdDev Code Size
TryGet_With_Hash_Collisions 1 21.75 ns 0.171 ns 0.151 ns 5,598 B
TryGet_With_Hash_Collisions 2 14.82 ns 0.297 ns 0.278 ns 5,551 B
TryGet_With_Hash_Collisions 3 21.62 ns 0.232 ns 0.205 ns 5,649 B
TryGet_With_Hash_Collisions 4 14.73 ns 0.094 ns 0.088 ns 5,553 B
TryGet_With_Hash_Collisions 30 15.31 ns 0.204 ns 0.191 ns 5,528 B
TryGet_With_Hash_Collisions 31 22.44 ns 0.212 ns 0.188 ns 5,606 B

SlottedArray upgraded design

                                                                   
 ┌─────────────────┬────────────────────┬─────────────────────────┐  
 │ HEADER          │ VECTOR of Hashes   │ VECTOR of Slots         │  
 │                 │                    │                         │  
 │                 │                    │                         │  
 │                 │                    │                         │  
 ├─────────────────┴──────┬─────────────┴───────┬─────────────────┤  
 │ VECTOR or Hashes       │ Vector of Slots     │                 │  
 │                        │                     │                 │  
 │                        │                     │                 │  
 │                        │                     │                 │  
 ├────────────────────────┴─────────────────────┘                 │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                            ┌─────────┬─────────────────────────┤  
 │                            │         │                         │  
 │                            │         │                         │  
 │                            │  DATA   │                DATA     │  
 │                            │for slot1│              for slot 0 │  
 │                            │         │                         │  
 └────────────────────────────┴─────────┴─────────────────────────┘  
                                                                      

@Scooletz Scooletz added 🐌 performance Perofrmance related issue 💥Breaking The change introduces a storage breaking change. labels Jul 24, 2024
Copy link

Code Coverage

Package Line Rate Branch Rate Health
Paprika 84% 79%
Summary 84% (4224 / 4999) 79% (1336 / 1701)

Minimum allowed line rate is 75%

@Scooletz Scooletz marked this pull request as ready for review July 26, 2024 10:50
@Scooletz Scooletz merged commit d91122a into main Jul 26, 2024
2 checks passed
@Scooletz Scooletz deleted the slotted-array-vectorized branch July 26, 2024 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💥Breaking The change introduces a storage breaking change. 🐌 performance Perofrmance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant